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DESCRIPTION 

Syllabic Nudei Extracting Apparatus and Program Product Thereof 

"C^ltnt invention generally relates to a technr.ue for extracting 
a portion representing characteristics of the waveform from a speech 
waveform with high reUability, and more specifically, -1^'- 
technique for extracting an area, ftom the speech waveform effective to 
estlle with high reUabilityastateofasource of thespeech waveform. 

Background Art 
[Word Definition ll 

First words used in this section will be defined. 
■Tressed sound" refers to a sound produced with one's glottis closed 
tight, so that the air does not smoothly flow through the glottis and J^e 
acceleration of the airflow passing through the glottis becomes large. Here, 
the glottal flow waveform is much deformed from a sine curve^d a 
grafent of its differential waveform locally becomes large When a speech 
has such characteristics, the speech will be referred to as pressed speech _ 
4eathy sound" refers to a sound produced with one s glottis opened 
and not tight, so that airflows smoothly and as a result, the glottalflow 
waveform becomes closer to a sine curve. Here. ti.e gradient of the 
aifferential waveform of ti.e glottal flow waveform does notlo^ybem^^^ 
large. When a speech has such characteristics, the speech wJl be referred 

'""'"^ri:s'„asoundhetweenthepressedandbreathy^^^^^^^ 

'■AQ (Amphtude Quotient)" is a peak to-peak amphtude of the glottal 
flow waveform divided by the amphtude of the minimum of the flow 
derivative. 

"""" W synthesis is as important a field of phonetic study as speech 
recognition. Recent development in signal processing technology promoted 



use of speech synthesis in many fields. Conventional speech synthesis is, 
however, simple production of speech from text information, and subtie 
emotional expression observed in human conversation cannot be expected. 
By way of example, human conversation transmits information such 
5 as anger, joy and sadness through vocal sound and the like, other than the 
information of the speech contents. Information other than the language, 
accompanying the speech will be referred to as paraHnguistic information. 
Such information cannot be represented with text information only. In the 
conventional speech synthesis, however, it has been difficult to transmit 
10 such paralinguistic information. For higher efficiency of man-machine 
interface, it is desirable to transmit not only the text information but also 
the paralinguistic information at the time of speech synthesis. 

As a solution to this problem, continuous speech synthesis in various 
utterance styles has been proposed. A specific approach is as follows. 
15 Speeches are recorded and converted to data processable form to prepare a 
database, and speech units in the database that are considered to express 
desired features (such as anger, joy, and sadness) are labeled 
correspondingly. At the time of speech synthesis, a speech having a label 
corresponding to the desired paraHnguistic information is utilized. 
20 However, the preparation of a database with sufficient coverage of 

speaking-styles necessarily impHes processing of huge amounts of recorded 
speech. Therefore, automatic feature extraction and labeling without 
operator supervision must be ensured. 

25 Examples of the paralinguistic information are as follows. One of 

the speaking styles is the discrimination between pressed sound and 
breathy sound. The pressed sound is produced rather strongly, because 
the glottis is tight. The breathy sound is not perceived as strong, because 
the voice has a near sine curve. Accordingly, discrimination between 

30 pressed sound and breathy sound is a significant speaking style, and if 
represented in a numerical value, the degree thereof may possibly be 
utilized as paralinguistic information. 

A great deal of research has been reported on the acoustic cues, 
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which differentiate breathiness from pressed voice quahty. See, for 
example. Reference 1 Usted on the last part of the specification The . 
majority of such studies, however, have been limited to speech (or singing) 
data recorded during sustained phonation of steady-state vowels^ It 
indeed remains a challenge to quantify with high rehability the degree of 
pressedness or breathiness, from acoustic measurements in large amounts 
ofrecorded speech data, andifrealized, this wouldbe very helpful. 

While various measures have been proposed which approximate 
properties of the voice-source in the spectral domain, the most direct 
estimates are obtained from a combination of the glottal-flow waveform and 
its derivative. An example of such approximation is AQ proposed in 
Reference 2 hsted on the last part of the specification. 

One advantage of AQ is explained in Reference 2 to be its relative 
independence of the sound pressure level (SPL) and its rehance primarily 
onphonatoryquahty. Another possible advantage is that it is a purely 
ampHtude-domain parameter and should therefore be relatively immune to 
the sources of error in measuring time-domain features of the estima ed 
glottal waveform. The authors of Reference 2 have found that for all of 
four male and four female speakers producing the sustained vowe a with 
a range of phonation types, the value of AQ decreased -o-f -^^^^^^^ 
phonation was changed from breathy to pressed (Reference 2, p. 136). AQ 
seems therefore promising in our efforts to solve the problem discussed m 
the foregoing. It is noted, however, that the following conditions must be 
satisfied, to have AQs effectively applied: , , , . 

1) AQs can be measured robustly and rehably in recorded natural 

'^^^'^2)^tceptual salience of the parameter as measured under such 
conditions can be validated. 

To satisfy such conditions, it is of importance how to rehably extract, 
from speech waveforms representative of physical quantities, such as 
naturally produced voices, parameters representative of features of the 
speech waveforms. Particularly, speeches may have portions that are 
rehable and not rehable to extract parameters, when the utterances are not 
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fuUy and closely controUed by the speaker or when various speakers give 
utterances in various styles. Therefore, it is important to choose which 
portion of the speech waveform as the object of processing. To this end. a 
central portion of a syllable (tentatively referred to as "syUabic nuclei") 
must correctly be extracted where a syUable serves as a unit of sound 
production, as in the case of Japanese. 

Disclosure of the Invention 

Therefore, an object of the present invention is to enable automatic 
determination of a portion that reliably represents a feature of a speech 
waveform Another object of the present invention is to enable 
determination of a portion that rehably represents a feature of a speech 
waveform without operator supervision. A further object of the present 
invention is to enable reliable automatic extraction of syUabic nuclei. 

A first aspect of the present invention relates to an apparatus for 
determining a portion rehably representing a feature of a speech waveform, 
based on speech waveform data representing physical quantities, which can 
be divided into a plurality of syllables, as well as to a program causing a 
computer to operate as such an apparatus. The apparatus includes- an 
extracting means for calculating, from the data, distribution of an ener^ of 
a prescribed frequency range of the speech waveform on a time axis, and for 
extracting, among various syUables of the speech waveform, a range that is 
generated stably by a source of the speech waveform, based on the 
distribution and pitch of said speech waveform; an estimating means for 
calculating, from the data, distribution of spectrum of the speech waveform 
on the time axis, and for estimating, based on the spectral distribution on 
the time axis, a range of the speech waveform of which change is weU 
controlled by the source", and a means for determining that range which is 
extracted by the extracting means as the range generated stably by the 
source and of which speech waveform is estimated by the estimating means 
to be weU controUed by the source, as a highly reliable portion of the speech 

waveform. . j ^ a 

As the highly reliable portion of the speech waveform is determined 
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based both on the result of extraction by the extracting means and on the 
result of estimation by the estimating means, the determined result is 
highly robust. 

The extracting means may include: a voiced/unvoiced determinmg 
5 means for determining, based on the data, whether each segment of the 
speech waveform is voiced or unvoiced; a means for separating the speech 
waveform into syUables at a local minimum of the waveform of energy 
distribution of the prescribed frequency range of the speech waveform on 
the time axis; and a means for extracting that range of the speech 
10 waveform which includes, in each syUable, an energy peak in that syUable 
within the segment determined to be a voiced segment by the 
voiced/unvoiced determining means and in which the energy of the 
prescribed frequency range is not lower than a prescribed threshold value. 
In a segment that is determined to be a voiced segment, a range of 
15 which energy of the prescribed frequency range is not lower than the 
prescribed threshold value is extracted. Therefore, a segment that is 
produced stably by the speaker can rehably be extracted. 

Preferably, the estimating means includes: a linear predicting means 
for performing Unear prediction analysis on the speech waveform and 
outputting an estimated value of formant frequency; a first calculating 
means for calculating, using the data, distribution of non- reliability of the 
estimated value of formant frequency provided by the linear predictmg 
means on the time axis; a second calculating means for calculating, based 
on an output from the Hnear predicting means, distribution on the time 
axis of local variance of spectral change on the time axis of the speech 
waveform; and means for estimating, based both on the distribution on the 
time axis of non-reliability of the estimated value of formant frequency 
calculated by the first calculating means and on the distribution on the 
time axis of local variance of spectral change in the speech waveform 
calculated by the second calculating means, a range in which change in the 
speech waveform is weU controUed by the source. 

A range in which the change in speech waveform is well controUed 
by the source is estimated based both on the non-reliability of estimated 
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value of formant frequency and on the local variance of spectral change on 
the time axis of the speech waveform. As the range m which vibration is 
controUed with clear intent by the source of vibration (for example, the 
speaker) is estimated, and if the feature of vibration is calculated from such 
5 a range the calculated feature is expected to have high reliability. 

The determining means may include a means for determining, as a 
highly rehable portion of the speech waveform, a range included in the 
range extracted by the extracting means, within the range of which change 
in speech waveform is estimated by the estimating means to be weU 
10 controlled by the source. 

Among the ranges of which change in speech waveform is estimated 
to be well controlled by the source, only the range in which the speech 
waveform is stably generated by the source is determined to be the highly 
rehable portion. Therefore, only the truly rehable portion can be extracted. 
15 According to another aspect, the present invention relates to a quasi- 

syUabic nuclei extracting apparatus for separating speech signal into quasr 
syUables and extracting a nuclear portion of each quasi-syllable, and to a 
program causing a computer to operate as such an apparatus. The quasi- 
syllabic nuclei extracting apparatus includes: a voiced/unvoiced 
20 determining means for determining whether each segment of the speech 
signal is voiced or unvoiced; a means for separating the speech signal into 
quasi-syllables at a local minimum of time- distribution waveform of an 
energy of a prescribed frequency range of the speech signal; and a means 
for extracting that range of the speech signal which includes energy peak in 
25 each quasi syllable, determined by the voiced/unvoiced determming means 
to be a voiced segment and of which energy of the prescribed frequency 
range is not lower than a prescribed threshold value, as the nuclei of quasi- 
syllable. 

A range in the segment determined to be a voiced segment and 
30 having the energy in the prescribed frequency range not lower than a 

prescribed threshold value is extracted as the nuclei of the quasi syUable, 
so that the voice stably produced by the speaker can be extracted. 

According to a still further aspect, the present invention relates to an 
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apparatus for determming a portion representing, with high rehabihty, a 
feature of a speech signal, and to a program causing a computer to operate 
as such an apparatus. The apparatus includes a linear predicting means 
for performing hnear prediction analysis on the speech signal; a first 
calculating means for calculating, based on an estimated value of formant 
provided by the linear predicting means and on the speech signal, 
distribution on time axis of non-reUabihty of the formant estimated value; a 
second calculating means for calculating, based on the result of hnear 
prediction analysis by the hnear predicting means, distribution on time 
axis of local variance of spectral change in the speech signal; and a means 
for estimating, based on the distribution on time axis of the non rehabihty 
of the estimated value of formant firequency calculated by the first 
calculating means, and on the distribution on time axis of local variance of 
spectral change in the speech waveform calculated by the second 
calculating means, a range in which the change in speech waveform is weU 
controlled by the source. 

Both the distribution on time axis of the non-rehabihty of formant 
estimated value and the distribution on time axis of local variance of 
spectral change in the speech signal represent, at their local minima, 
portions of which generation of speech waveform is weU controlled by the 
source, among the speech signals. As the range is estimated using these 
two pieces of information, the portion at which generation of speech 
waveform is well controUed can be identified with high rehabihty. 
Brief Description of the Drawings 

Fig 1 shows an appearance of a computer system executing a 
program in accordance with an embodiment of the present invention. 

Fig 2 is a block diagram of the computer system shown in Fig. 1. 
Fig 3 is a block diagram representing an overall configuration of the 
program in accordance with an embodiment of the present invention. 
Fig. 4 schematically shows speech data. 

Fig. 5 is a block diagram of an acoustic/prosodic analysis unit 92 

shown in Fig. 3. . , ■ r? „ 

Fig. 6 is a block diagram of a cepstral analysis unit 94 shown m Fig. 
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^' Fig. 7 is a block diagram of a standardizing and integrating unit 144 

shown in Fig. 6. «o i. • 

Fig. 8 is a block diagram of a formant optimizing unit 98 shown in 

Fig. 3. 

Fig 9 is a block diagram of an AQ calculating unit 100. 
Fig. 10 is an exemplary display given by the program in accordance 
with an embodiment of the present invention. 

Fig 11 shows an estimated glottal flow waveform, an estimated 
derivative of the glottal flow waveform, and a spectrum of the estimated 
glottal flow waveform, of a point of speech data that is determined to be a 
pressed sound. 

Fig 12 shows an estimated glottal flow waveform, an estimated 
derivative of the glottal flow waveform, and a spectrum of the estimated 
glottal flow waveform, of a point of speech data that is determined to be a 
breathy sound. 

Fig. 13 is a scatter plot representing a relation between the sensed 
breathiness and acoustically measured AQ. 

20 Best Modes for Carrying Out the Invention 

Embodiments of the present invention that wiU be described m the 
following are implemented by a computer and software running on the 

computer. It is needless to say that part of or all of the functions described 
below may be implemented by hardware, rather than the software. 

25 [Word Definition 2] n u ^ « .i 

. Words used in the description of the embodiments wiU be defined. 
A "pseudo-syUable" refers to a bounded segment of a signal 
determined by a prescribed signal processing of the speech signal, which 
may correspond to a syUable or syllables in the case of Japanese speech^ 
30 "Sonorant energy" refers to an energy of a prescribed frequency (for 

example, frequency range of 60Hz to 3kHz) of the speech signal, 
represented in decibels. 

"Center of rehability" refers to a range that comes to be regarded as a 
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portion of the speech waveform, from which the feature of the object 
waveform can be extracted with high reliability, as a result of signal 
processing of the speech waveform. 

A "dip" refers to a constricted portion of a graph or figure. 
Particularly, a dip refers to a portion that corresponds to a local minima of 
a waveform formed by a distribution on a time axis of values that vary as a 

function of time. 

"UnreUability" is a measure representing lack of reliability. 
Unreliability is a concept opposite to reUabUity. 

Fig 1 shows an appearance of a computer system 20 used in the 
present embodiment, and Fig. 2 is a block diagram of computer system 20. 
It is noted that computer system 20 shown here is only an example and 
various other configurations are available. 

Referring to Fig. 1, computer system 20 includes a computer 40. and 
a monitor 42, a keyboard 46, and a mouse 48 that are all connected to 
computer 40. Further, computer 40 has a CD-ROM (Compact Disc Read ■ 
Only Memory) drive 50 and an FD (Flexible Disk) drive 52 provided therein. 

Referring to Fig. 2, computer system 20 further includes a printer 44 
connected to computer 40. which is not shown in Fig. 1. Computer 40 
further includes a bus G6 connected to CD ROM drive 50 and FD dnve 52^ 
and a CPU (Central Processing Unit) 56, an ROM (Read-Only Memory) 58 
storing a boot-up program of the computer and the Uke, an RMl (Random 
Access Memory) 60 providing a work area used by CPU 56 and a storage 
area tor a program executed by CPU 56, and a hard disk 54 stoiing tiie 
speech database, which wiU be described later, all comiected to bus 66. 

The software that implements the system of the embodiment 
described in the following is distributed recorded on a recorang medium 
such as a CD-ROM 62. read to computer 40 through a rea^ng device such 
as CD-ROM drive 50. and stored in hard disk 54. When CPU 56 executes 
the program, the program is read from hard ask 54 and stored in RAM 60. 
an instruction is read from an address designated by a program counter^^ 
not shown, and the instruction is executed. CPU 56 reads the data as the 
object of processing fi^m hard disk 54. and stores the result of processing 
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also in hard disk 54. j ^ i j 

As the operation of computer system 20 itself is weU-known, detailed 

description will not be given here. 

As to the manner of software distribution, it may not necessarily be 
fixed on a recording medium. By way of example, the software may be 
distributed &om another computer connected through a network from 
which data is received. A part of the software may be stored in hard disk 
54, and the remaining part of the software may be taken through a network 
to hard disk 54 and integrated at the time of execution. 

Typically, a modem computer utilizes general functions provided by 
the operating system (OS) of the computer, and executes the functions m an 
organized manner in accordance with a desired object, to attain the object. 
Therefore, it is obvious that a program or programs not mduding the 
general function provided by the OS or by a third party and designatmg 
only a combination of execution orders of the general functions fall within 
the scope of the present invention, as long as the program or programs have 
the control structure that, as a whole, attains the desired object using such 
combination. 

The block diagrams of Fig. 3 and the following figures represent the 
program of the present embodiment as an apparatus. Referring to Fig. 3. 
the apparatus 80 performs the foUowing processes on speech data 82 stored 
in hard disk 54, to calculate and output AQ described above, for each 
process unit (by way of example, for each syllable), included in the speech 
data. As will be described later, the speech data is divided in advance mto 
frames, each of 32msec. ^ 
Apparatus 80 includes an FFT processing unit 90 performing Fast 
Fourier Transform (FFT) on the speech data; an acoustic/prosodic analysis 
unit 92 using an output from FFT processing unit 90, for extracting a range 
that is produced stably (hereinafter referred to as "pseudo-syUabic nuclei ) 
by the vocal apparatus of a speaker irom various syllables of the speech 
waveform given by the speech data, based on time-change in energy m the 
frequency range of 60 Hz to 3kHz of the speech waveform given by the 
speech data and on the change in speech pitch; and a cepstral analysis unit 



- 10 - 



94 performing cepstral analysis on speech data 82 and estimating a portion 
that has small variation in speech spectrum and from which the feature of 
speech data is heheved to he extracted with high rehability (hereinafter 
this portion will be referred to as a "center of a portion of high rehabihty 
and small variation", a "center of high reliability and small variation or 
simply as a "center of reliability"), as a result of cepstral analysis using an 
output from FFT processing unit 90 . 

Apparatus 80 further includes: a pseudo syllabic center extracting 
unit 96 extracting, as a pseudo-syllabic center, only that one of the centers 
of portions of high reliabiHty and smaU variation output from cepstral 
analysis unit 94 which is in the pseudo-syUabic nuclei output from 
acoustic/prosodic analysis unit 92; a formant optimizing unit 98 performing 
initial estimation and optimization offormant on the speech data 
corresponding to the pseudo-syUabic center extracted by pseudo syUabic 
center extracting unit 96, for outputting a final estimation offormant; and 
an AQ calculating unit 100 estimating a derivative of the glottal flow 
waveform by performing a signal processing such as adaptive filtering 
using the formant value output firom formant optimizmg unit 98, 
estimating the glottal flow waveform by integrating the resulting 
estimation, and calculating AQ therefirom. 

Fig 4 schematicaUy shows the speech data. Referring to Fig. 4, a 
speech data waveform 102 is divided into fi:ames each of 32 msec and 
shifted by 8 msec firom proceeding and succeeding frames, and digitized. 
The process described in the foUowing proceeds such that at a time point tO, 
process starts fi:om the first firame as a head frame, and at a next time point 
tl, the process starts from the next, second firame as a head frame, which is 
delayed by 8 msec. 

Fig 5 is a block diagram of acoustic/prosodic analysis unit 92 shown 
in Fig 3 Referring to Fig. 5, acoustic/prosodic analysis unit 92 includes- a 
pitch determining unit 110 for determining whether an object fi:ame is a 
voiced or unvoiced segment, using the pitch of the source measured from 
the speech waveform (as to the method of determination, see Reference 3), a 
sonorant energy calculating unit 112 for calculating waveform distribution 
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of sonorant energy in a prescribed frequency range (60 Hz to 3 kHz) on a 
time axis, based on the output from FFT processing unit 90; a dip detecting 
unit 114 for applying convex-huU algorithm on a contour of distribution 
waveform of the sonorant energy on the time axis calculated by sonorant 
energy calculating unit 112, and for detecting a dip of the contour of 
distribution waveform of the sonorant energy on time axis, so as to divide 
the input speech into pseudo-syUables (as to the specific method, see 
References 4 and 5); and a voiced/energy determining unit 1 16 for locating 
a point attaining maximum sonorant energy (SE peak) and for expanding, 
one by one, frames on left and right sides of the peak which have the 
sonorant energy higher than a prescribed threshold value (0.8 x SE peak) 
and which is determined by pitch determining unit 1 10 to be the voiced 
segments, belonging to the same pseudo-syllable, to output the pseudo- 

syllabic nuclei. q 
Fig 6 is a block diagram of cepstral analysis unit 94 shown m Fig. 3. 
Referring to Fig. 6, cepstral analysis unit 94 includes a hnear prediction 
analysis unit 130 for performing selective hnear prediction (SLP) analysis 
on the speech waveform of speech data 82 and for outputtmg SLP cepstral 
coefacients and a formant estimating unit 132 for calculating initial 
estimations of frequency and bandwidth of first four formants based on the 
cepstral coefficients. Formant estimating unit 132 has learned mapping 
for a vowel formant measured carefully using the same data subset, and 
utihzing the hnear cep strum -form ant mapping proposed in Reference 6. 
For this learning, see Reference 7. 

Cepstral analysis unit 94 fiirther includes a cepstrum re generating 
unit 136 for re-calculating cepstral coefficients based on the 

estimated formant firequency and the hke) a logarithmic transformation 
and inverse discrete cosine transformation unit 140 for performing 
logarithmic transformation and inverse discrete cosine transformation on 
the output of FFT processing unit 90 and for calculating FFT cepstral 
coefficients; and a cepstral distance calculating unit 142 calculating a 
cepstral distance d} defined by the foUowing equation, representing 
differences between cepstral coefficients calculated by cepstrum re- 
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generating unit 136 and FFT cepstral coefficients calculated by 
logarithmic tranrformation andinverse discrete cosine transformation umt 
140 and outputting the same as an index representing unreliabiUty of the 
value of formant frequency estimated by formant estimating unit 132- 

Formant estimating unit 132, cepstrumre generatmg unit 136, 
cepstral distance calculating unit 142 and logarithmic transformation and 
inverse discrete cosine transformation unit 140 calculate unrebabihty of 
values such as the formant frequency estimated based on the result of 
hnear prediction analysis. , w „ 

Cepstral analysis unit 94 further includes: a A cepstium calculatmg 
unit 134 for calculating A cepstrum from the cepstral coefficients output 
from Unear prediction analysis unit 130; and an inter-frame variance 
calculating unit 138 calculating, for every frame, variance in magnitude of 
spectral change among five frames including the frame of interest. An 
output of inter-frame variance calculating unit 138 represents a contour of 
distribution waveform on the time axis of local spectral movement^of which 
local minimum is considered to represent controlled movement (CM) m 
accordance with the theory of articulatory phonetics proposed in Reference 

Cepstral analysis unit 94 further indudes: a standardizing and 
integrating unit 144 for receiving the value representative of unrehabdity 
of estimated formant frequency output from cepstral distance calculatmg 
unit 142 and a local inter-frame variance output from inter-frame variance 
calculating unit 138, and for standardizing and integrating these values to 
output the result as a distribution waveform on time axis of the value 
representing the unreUability of speech signal frame by frame, and a 
rehability center candidate output unit 146 for detecting a dip m a 
waveform contour formed by the distribution waveform on the time axis of 
the unreUability value output by standardizing and integrating unit 144 
using convex-huU algorithm, and outputting the same as a rehabJity center 

candidate. 4. iaa 

Fig. 7 is a block diagram of standardizing and integrating umt 144 
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shown in Fig 6. Referring to Fig. 7, standardizing and integrating unit 
144 indudes: a first standardizing unit 160 for standardizing the cepstral 
distance output from cepstral distance calculating unit 142 to the values in 
[0 U; a second standardizing unit 162 for standardizing the inter-frame 
variance calculated for each frame by inter-frame variance calculating unit 
138 to the values in [0, U; an interpolating unit 164 for performing hnear 
interpolating process such that the positions on time axis of local inter- 
frame variances match sampling timings of cepstral distance output from 
cepstral distance calculating unit 142; and an average calculating unit 166 
outputting an average of the outputs from first standardizing units 160 and 
interpolating unit 164 frame by frame. An output of average calculating 
unit 166 represente a contour of distribution waveform on the time axis oi 
the integrated value. By detecting a dip ttocal minimum) of the waveform 
contour by reliability center candidate output unit 146. the portion of 
lowest unrehabiUty (highest reliability) can be specified as the candidate of 

reliability center. . ^„ , „ 

Fig 8 is a block diagram of formant optimizing unit 98 shown in Fig. 
3 Referring to Fig. 8, formant optimizing unit 98 includes: an FFT 
processing unit 180 for performing FFT on the speech waveform; a 
logarithmic transformation and inverse DCT unit 182 for performing 
logarithmic transformation and inverse discrete cosine transformation on 
the output of FFT processing unit 180; a cepstral distance calculating unit 
184 calculating a cepstral distance between the FFT cepstral coefficients 
output from logarithmic transformation and inverse DCT unit 182 and an 
estimated formant value as will be described later; and a distance 
minimizing unit 186 for optimizing the estimated formant value by hill- 
cHmbing method such that the distance calculated by cepstral distance 
calculating unit 184 is minimized, using initial estimates of first to four h 
formant frequencies in each of the rehability center candidates as initial 
values The estimated formant value optimized by distance minimizing 
unit 186 is appUed to AQ calculating unit 100 as an output of formant 
optimizing unit 98. 

Referring to Fig. 9. AQ calculating unit 100 includes: a high pass 
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filter 200 selectively passing only the frequency component of 70 Hz or 
higher of the 64 msec portion at a position corresponding to the syUabic 
center of the speech signal; an adaptive low pass filter 202 selectively 
passing only the frequency component that is not higher than the sum of 
optimized fourth formant frequency and its bandwidth, from the outputs of 
high pass filter 200; and an adaptive inverse filter 204 for performing 
adaptive inverse filtering using first to fourth formant frequencies on the 
outputs of adaptive low pass filter 202. The output of adaptive inverse 
filter 204 will be the derivative waveform of the glottal flow waveform. 

AQ calculating unit 100 further includes: an integrating circuit 206 
integrating the outputs of adaptive inverse filter 204 and outputting the 
glottal flow waveform; a maximum peak-to-peak amphtude detecting 

circuit 208 for detecting maximum peak-to-peak amphtude of the output of 
integrating circuit 206; a lowest negative peak amphtude detecting circuit 
210 for detecting maximum amphtude of a negative peak of the output ot 
adaptive inverse filter 204; and a ratio calculating circuit 212 for 
calculating a ratio of the output of maximum peak-to peak amphtude 
detecting circuit 208 to the output of lowest negative peak amphtude 
detecting circuit 210. The output of ratio calculating circuit 212 is AQ. 

The apparatus described above operates in tiie foUowing manner. 
First the used speech data 82 will be described. The speech data is the 
one used in Reference 9, which was prepared by recording three stones 
read by a female, native speaker of Japanese. These stories were prepared 
to evoke the emotions of anger, joy and sadness. Each story contained 
more than 400 sentence-length utterances (or more than 30,000 phonemes). 
These utterances are stored in separate speech wave files for independent 

processing. ^^r^ 

Each sentence-length utterance data is subjected to FFT processing 
bv FFT processing unit 90. and thereafter, subjected to the foUowing 
processes, which proceed along two main strands. One is acoustic prosodic 
processing performed by acoustic/prosodic analysis unit 92. and the other is 
acoustic-phonetic processing performed by cepstral analysis unit 94. 

In the acoustic-prosodic strand, sonorant energy in the fi-equency 



range of 60 Hz to 3 kHz is calculated by sonorant energy calculating unit 
112 shown in Fig. 5. From the contour of the entire waveform of utterance 
data of one sentence output from sonorant energy calculating unit 112^dip 
detecting unit 114 detects a dip, by applying convex huU algorithm. By 
the dip quasi-syllabic segmentation of the utterance is obtained. 

The voiced/energy determining unit 116 finds a point (SEpeak) 
having the maximum sonorant energy among the quasi syUables. This 
point is the initial point of the quasi-syllabic nuclei. Further, 
voiced/energy determining unit 116 extends, starting from the initial point 
and frame by frame both to the left and to the right, the range of the quasr 
syUabic nuclei, until a fi-ame of which sonorant energy is not higher than 
0 8 X SEpeak. a frame determined by pitch determining umt 1 10 to be not 
voiced or a frame out of the quasi-syUabic nuclei is encountered. In this 
manner, the boundaries of quasi- syllabic nuclei area determined, of which 
information is appHed to pseudo- syllabic center extracting unit 96. 
Though the value 0.8 is used here as the threshold, it is a mere example, 
and the value must be changed appropriately dependent on apphcation^ 

Referring to Fig. 6, for one input utterance, linear prediction analysis 
unit 130 performs linear prediction analysis, and outputs SLP cepstral 
coefficients Based on the SLP cepstral coefficients. A cepstrum 
calculating unit 134 calculate A cepstrum. which is applied to inter frame 
variance calculating unit 138. Based on A cepstral coefficients, inter- 
frame variance calculating unit 138 calculates variance of local spectral 
variation in five frames including the firame of interest, for each frame. It 
is considered that the smaUer the variance, the better controUed the 
utterance by the speaker, and the larger the variance, the poorer controUed 
the utterance by the speaker. Therefore, the output of inter-firame 
variance calculating unit 138 is believed to represent the degree how 
unreliable the utterance by the speaker is (represents unrehabihty). 

Further referring to Fig. 6. formant estimating unit 132 estimates 
frequencies and bandwidths of the first to fourth formants. based on the 
SLP cepstial coefficients, using linear cepstial formant mapping. 
Cepstrum re-generating unit 136 calculates the cepstral coeffiaents in an 
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inverse manner based on the first to fourth foi-mants estimated by formant 
estimating unit 132, and appHes the same to cepstral distance calculating 
unit 142 Logarithmic transformation and inverse discrete cosine 
transformation unit 140 performs logarithmic transformation and inverse 
discrete cosine transformation on the original speech data of the same 
frame as that processed by formant estimating unit 132 and cepstrum re- 
generating unit 136 to obtain FFT cepstral coefficients, which is apphed to 
cepstial distance calculating unit 142. Cepstral distance calculating unit 
142 calculates the distance between the cepstral coefficients from cepstrum 
re-generating unit 136 and the cepstral coefficients fi-om loganthmic 
transformation and inverse discrete cosine transformation unit 140 m 
accordance with equation (l) above. The result is considered to be a 
waveform representing a distribution on time axis of values indicating 
unrehability of the formant estimated by formant estimating unit 132. 
Cepstral distance calculating unit 142 appUes the result to standardizing 
and integrating unit 144. , . . . 

Referring to Fig. 7, a first standardizing unit 160 of standardizing 
and integrating unit 144 normalizes the value of unrehability of each frame 
calculated from the estimated formant value output fi:om cepstial distance 
calculating unit 142 within the range of [0, l], and applies the result to an 
average calculating unit 166. A second standardizing unit 162 normahzes 
the value of local inter-frame variance calculated frame by frame and 
output by inter-frame variance calculating unit 138 shown in Fig. 6 withm 
the range of [0, l], and appHes the result to interpolating unit 164. 
Interpolating unit 164 performs linear interpolation on each value from 
second standardizing unit 162 to obtain a value that corresponds to the 
sampling point of each frame output fi:om first standardizing unit 160, and 
appUes the result to average calculating unit 166. Average calculating 
unit 166 normalizes the outputs fi:om the first standardizing unit 160 and 
of interpolating unit 164 frame by frame, and outputs the result to 
rehability center candidate output unit 146 as an integrated waveform 
representing the distribution of unreliability on the time axis. 

Reliability center candidate output unit 146 detects the dip of the 
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contour of integrated waveform output from standardizing and integrating 
unit 144 in accordance with convex-huU algorithm, and outputs 
information specifying the frame as the candidate of reliability center, to a 
pseudo-syUabic center extracting unit 96 shown in Fig. 3. 

Pseudo-syUabic center extracting unit 96 shown in Fig. 3 extracts, 
from the centers of rehability applied from rehability center candidate 
output unit 146 shown in Fig. 6. only those that are among the pseudo- 
syUabic nuclei apphed from acoustic/prosodic analysis umt 92. 

Through the processes described above, now we have obtained the 
information of the speech data that extracts feature of speech data, or 
represents a range having high reliabihty and small variation suitable for 
labehng speech data. Therefore, a desired processing may be performed 
on the frame specified by the information. In the apparatus in accordance 
with the present embodiment, pseudo-syllabic center extracting unit 96 
appHes this information to formant optimizing unit 98, and formant 
optimizing unit 98 calculates AQ at the pseudo syUabic center m the 
following manner, using this information. 

In the apparatus of the present embodiment, the length of pseudo- 
syllabic center is determined to be five successive frames. Duration of one 
frame is 32 msec, and successive frames are delayed by 8 msec from each 
other, and therefore, duration of five frames in total corresponds to a 
speech period of 64 msec. 

AQ at each of these quasi-syUabic center can be calculated directly 
from the glottal flow waveform obtained by AQ calculating unit 100 shown 
in Fig 9 The estimate of glottal flow itself is influenced by the vocal tract 
resonance that corresponds to the original formant, and therefore, 
reliability thereof depends on whether the influence of resonance can be 
removed from the data of 64 msec of speech waveform. Therefore. AQ 
obtained through such calculation is unreHable. 

Specifically, referring to Fig. 8, FFT processing unit 180 performs 
FFT processing on every frame of the speech waveform. Logarithmic 
transformation and inverse DCT unit 182 performs logarithmic 
transformation and inverse discrete cosine transform on the output of FFT 
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processing unit 180. Cepstral distance calculating unit 184 calculates 
distance between the cepstral coefacients output from logarithmic 
transformation and inverse DCT unit 182 and estimated cepstral 
coefficients appHed from distance minimizing unit 186. Distance 
minimizing unit 186 further optimizes the value of cepstral coefficients 
appHed from distance minimizing unit 186 such that the distance 
calculated by cepstral distance calculating unit 184 is minimized, in 
accordance with the hill-cHmbing method, starting from the value of the 
cepstral coefficients indicating the estimated formant value, and outputs 
the estimated formant value at which the minimum value is attained. 

Internal configuration of AQ calculating unit 100 is shown m Fig. 9. 
Referring to Fig. 9. the speech data of the quasi-syUabic center is first 
passed through high-pass filter 200, and as a result, noise as low as 70 Hz 
or lower is removed. Thereafter, by adaptive low pass filter 202, spectral 
information of a frequency range higher than the fourth formant is 
removed. Then, by adaptive inverse filter 204. influence of first to fourth 
formants is removed. 

As a result, the output of adaptive inverse filter 204 becomes a good 
estimated derivative of the glottal flow waveform. By integratmg this 
output by integrating circuit 206. an estimated value of glottal flow 
waveform can be obtained. Maximum peak-to peak ampHtude detecting 
circuit 208 detects the maximum peak-to-peak amphtude of the glottal flow. 
Lowest negative peak ampHtude detecting circuit 210 detects maximum 
negative amplitude within the cycle of derivative waveform of the glottal 
flow The ratio of the output of maximum peak to-peak amphtude 
detecting circuit 208 to the output of lowest negative peak amphtude 
detecting circuit 210 is calculated by ratio calculating circuit 212, whereby 
AQ at the quasi syUabic center can be obtained. 

AQ obtained in this manner represents with high rehabihty the 
feature (degree of pressed-breathy sound) of the original speech data at 
each quasi-syUabic center. By calculating AQs for the quasi-syUabic 
centers and by interpolating the thus obtained AQs. it becomes possible to 
estimate AQ of a portion other than the quasi syUabic centers. 
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Accordingly, when an appropriate label corresponding to a prescribed AQ is 
attached as para-Hnguistic information to a portion of speech data that 
represents the prescribed AQ. and when the speech data having a desired 
AQ is used at the time of voice synthesis, speech synthesis including not 
only the text but also the para-linguistic information can be attained. 

Figs. 10 to 12 are exemplary displays that appear when the 
apparatus of the present embodiment is implemented by a computer. 

Referring to Fig. 10, in accordance with the program, the display 
window displays: speech data 240; speech label 242 attached to the speech 
data; a contour 244 of the distribution waveform on the time axis of 
reference frequency waveform; a contour 246 of the distribution waveform 
on the time axis of the sonorant energy variation; contour 248 of the 
distribution waveform on the time axis of local variance in spectral 
variation calculated from the A cepstrum; a contour 250 of the distribution 
waveform on the time axis of the formant-FFT cepstral distance; a contour 
252 of the distribution waveform on the time axis of unreliabihty that is a 
waveform obtained by integrating the contour 248 of the distribution 
waveform on the time axis of local variance in spectral variation and the 
contour 250 of the distribution waveform on the time axis of the formant- 
FFT cepstral distance; AQs of the glottis at the pseudo-syUabic centers 
calculated in the above-described manner; and an area function of the vocal 
tract estimated at each pseudo syUabic center. 

The thick vertical lines 232 appearing in the display area of speech 
data waveform 240 and the thick vertical hnes appearing in the display 
area of sonorant energy variation contour 246 represent boundaries of 
quasi-syUables. Thin vertical Hnes 230 appearing in the display area of 
speech data waveform 240 and thin vertical lines appearing in the display 
areas of sonorant energy variation contour 246 and reference frequency 
waveform contour 244 represent boundaries of pseudo-syUabic nuclei. 

Vertical Hnes appearing in the display areas of unreHabiHty 
waveform 252 represent local minima portions (dips) of the waveform, and 
the portion of which AQ is calculated with each dip being the center is the 
portion of highest reHabiHty. The period of calculation and the value of 
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At 



each AQ are represented by horizontal bar. and the higher the vertical 
position of horizontal bar, the closer becomes the sound to the pressed 
sound, and the lower the position, the closer to the breathy sound. 

Fig 11 shows the estimated glottal flow waveform 270, derivative 
5 272 thereof, and spectrum 274 of the estimated glottal flow waveform, at 
thetimepointindicatedbyadottedbox262ontheleftsideofFig. 10. i 
the time point corresponding to box 262 of Fig. 10, AQ 254 is high, that is, 
the sound is close to a pressed sound at this time point. As can be seen 
from Fig 11, the waveform of the glottal flow at this time point is close to a 
10 saw tooth wave, and much different from a sine wave. The derivative 

waveform changes steeply. 

Fig 12 shows the estimated glottal flow waveform 280, derivative 
282 thereof, and spectrum 284 of the estimated glottal flow waveform, at 
thetimepointindicatedby adottedbox260ofFig. 10. At the time pomt 
15 corresponding to box 260 of Fig. 10, AQ 254 is low, that is, the sound is 
dose to a breathy sound at this time point. As can be seen from Fig. 12, 
the waveform of the glottal flow at this time point is close to a dear sine 

Using the apparatus described above, the speech data were actuaUy 
20 processed to extract pseudo syUabic centers, and AQ of each pseudo syUabic 
center was calculated. Correlation between the listener's impression when 
he/she hears the sound corresponding to such pseudo-syUabic centers and 
AQs was investigated in the following manner. 

Using the above-described apparatus, 22,000 centers of rehabihty 
25 were extracted, andfor each of the centers, corresponding 

waveform and AQ, as weU as EMS (Root Mean Square) energy (dB) of the 
original speech waveform were calculated. Of these centers of rehabihty, 
those existing in the same syUabic nudei and having approximately the 
same AQs were combined, and further, among the centers of rehabihty, 
30 those having the integrated unrehabflity value not lower than 0.2 were 
disregarded. Consequently, the number of syUabic nuclei that were 
considered usable as the auditory stimuU was reduced to shghtly over 
15.000. 
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Based on statistics computed over this data set, a subset of 60 stimuh 
was selected to be used in a perceptual evaluation. In particular, for each 
of the three emotion databases described above, five syUabic nuclei were 
selected whose reliability centers have AQ belonging to either of the four 
categories: extremely low; extremely high; around the mean of AQs for 
respective emotions minus one standard- deviation (a) of the distribution, 
and around the mean of AQs plus standard- deviation. 

The durations of the 60 quasi-syUabic nuclei selected in this manner 
ranged from 32 msec to 560 msec, with a mean of 171 msec. Eleven 
normal-hearing subjects participated in an auditory evaluation of these 
short stimuh. The subjects Hstened to each stimulus as many times as 
required over high-quality headphones in a quiet office environment, and 
rated each on two separate, 7-point scales which were explained simply as 
"perceived breathiness" and "perceived loudness", respectively. The 
ratings of each subject were then proportionally normalized onto the range 
of [0 1] These normahzed scores were averaged across aU 1 1 subjects to 
obtain a mean value representing breathiness and of loudness for each of 
the 60 stimuh. 

Fig- 13 is a scatter plot comparing the breathiness studied in the 
above-described manner and the acousticaUymeasured AQs. The hnear 
coefficient of correlation for these 60 pairs of values was found to be 0.77. 
While this correlation is not particularly high, it supports an obvious trend 
that as the measured AQ increases, so too does the perceived breathiness of 
the speech stimulus on average. A closer examination of some of the 
points which he furthest firom an imaginary line of best fit on the scatter- 
plot revealed some potential causes of error: formant discontinuities 
across the five fi-ames, owing to a lack of dynamic constraints; a higher 
degree of breathiness during a part of the syUabic nucleus not included m 
the five frames; and strong influence of adjacent nasahty of the vocahc 
portion within the five frames. 

Furthermore, it is interesting to note fi:om Fig. 13 that there is a 
greater range of perceived breathiness for those stimuh with a mid-to-low 
AQ. It confirms intuition that it is a more difficult task to rate the 
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breathiness of stimuU which are perhaps better characterized by either 
modal or pressed phonation. 

Though not shown in the figure, a scatter plot was also prepared to 
compare the perceived loudness with the RMS energy measured in the 
same rehability centers. The correlation was found to be 0.83. thus 
confirming the strength of that relation despite not having used a more 
sophisticated, perceptually weighted measure of loudness. 

As described above, the present embodiment reaUzed a method and 
apparatus for (i) determining a position of a refiability center of quasi- 
syUabic nuclei in recorded natural speeches and for (ii) measuring sound 
source attributes quantified by AQs proposed in Reference 2, without 
necessitating any operator supervision. The result of voice perception 
experiments performed by using the method and apparatus confirmed the 
importance of AQ as values that enable robust measurement, having strong 
correlation with the perceived breathiness in the pseudo syllabic nuclei. 
In fact, though there was such an error source as described in the foregoing, 
it could be confirmed that further study of AQ as a sound quality parameter 
is necessary, because of the correlation found between AQ and the 
perceived breathiness. 

The embodiments as have been described here are mere examples 
and should not be interpreted as restrictive. The scope of the present 
invention is determined by each of the claims with appropriate 
consideration of the written description of the embodiments and embraces 
modifications within the meaning of, and equivalent to, the languages in 
the claims. 
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Industrial Applicability 

The present method and apparatus enable automatic para-hnguistic 
labeUng of speech units without operator supervision, facilitating database 
construction. When continuous speech synthesization is performed using 
the database of the speech units with desired labehng thus reahzed, it 
becomes possible to realize a man-machine interface using natural speech 
synthesis using wide range of speech styles ranging from pressed sound 
through modal to breathy sound. 
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